Skip to content

Conversation

@LHT129
Copy link
Owner

@LHT129 LHT129 commented Oct 29, 2025

Summary by Sourcery

Add JSON-based statistics support to datasets and search operations to record and retrieve an "is_timeout" flag for search results.

New Features:

  • Add Dataset::Stats setter and GetStats getters to store and retrieve stats as JSON strings or by keys
  • Record an "is_timeout" boolean flag in search parameters and include it in the dataset stats output

Enhancements:

  • Extend InnerSearchParam with a shared JSON object for stats initialization
  • Refactor search routines to initialize, update, and dump stats into the dataset

Tests:

  • Add assertions in TestSearchOvertime to validate the presence and correctness of the "is_timeout" stat

@sourcery-ai
Copy link

sourcery-ai bot commented Oct 29, 2025

Reviewer's Guide

This PR instruments search routines with an “is_timeout” statistic by extending the Dataset API to carry arbitrary stats, adding a stats JSON to InnerSearchParam, setting the timeout flag during search, and exposing the stats via new Dataset methods.

Sequence diagram for search routine timeout stat propagation

sequenceDiagram
participant SearchWithRequest
participant InnerSearchParam
participant BasicSearcher_search_impl
participant DatasetImpl

SearchWithRequest->>InnerSearchParam: create and initialize stats
SearchWithRequest->>BasicSearcher_search_impl: pass InnerSearchParam
BasicSearcher_search_impl->>InnerSearchParam: set stats["is_timeout"] = true if timeout
SearchWithRequest->>DatasetImpl: call Stats(stats.dump())
DatasetImpl->>DatasetImpl: store stats_ string
Loading

ER diagram for Dataset stats storage

erDiagram
DATASET {
    string stats_
}
INNER_SEARCH_PARAM {
    json stats
}
DATASET ||--o{ INNER_SEARCH_PARAM : "stores stats from"
Loading

Class diagram for updated Dataset and DatasetImpl stats API

classDiagram
class Dataset {
    <<interface>>
    +GetExtraInfoSize() int64_t
    +Stats(stats: std::string) DatasetPtr
    +GetStats() std::string
    +GetStats(stat_keys: std::vector<std::string>) std::vector<std::string>
}
class DatasetImpl {
    +Stats(stats: std::string) DatasetPtr
    +GetStats() std::string
    +GetStats(stat_keys: std::vector<std::string>) std::vector<std::string>
    -stats_ std::string
}
DatasetImpl --|> Dataset
Loading

Class diagram for InnerSearchParam stats addition

classDiagram
class InnerSearchParam {
    +stats std::shared_ptr<JsonType>
    +InnerSearchParam()
    +operator=(other: InnerSearchParam) InnerSearchParam&
    ...
}
Loading

File-Level Changes

Change Details Files
Extend Dataset API to store and retrieve stats
  • Added virtual Stats, GetStats(), and GetStats(keys) methods to Dataset interface
  • Implemented these methods and a stats_ member in DatasetImpl
  • Stats() returns shared pointer for chaining
include/vsag/dataset.h
src/dataset_impl.h
src/dataset_impl.cpp
Add stats JSON container to InnerSearchParam
  • Introduced shared_ptr stats and initialized in constructor
src/impl/inner_search_param.h
Record and propagate “is_timeout” in search flows
  • Initialize stats map on search start
  • Set stats["is_timeout"] on timeout in BasicSearcher
  • Attach serialized stats to Dataset results in HGraph
src/algorithm/hgraph.cpp
src/impl/basic_searcher.cpp
Introduce tests for the new stats feature
  • Verify GetStats() returns non-empty JSON
  • Verify GetStats({"is_timeout"}) returns a boolean string
tests/test_index.cpp

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `src/dataset_impl.cpp:36` </location>
<code_context>

+std::vector<std::string>
+DatasetImpl::GetStats(const std::vector<std::string>& stat_keys) const {
+    auto json = JsonType::parse(this->stats_);
+    std::vector<std::string> result;
+    for (const auto& key : stat_keys) {
</code_context>

<issue_to_address>
**issue:** Consider error handling for invalid JSON in stats_.

If stats_ is malformed, JsonType::parse may throw or behave unpredictably. Adding validation or error handling would improve robustness, especially if stats_ is externally set.
</issue_to_address>

### Comment 2
<location> `tests/test_index.cpp:2124-2130` </location>
<code_context>
             ->Owner(false);
         auto res = index->KnnSearch(query, 10, search_param);
         REQUIRE(res.has_value());
+        auto result = res.value();
+        REQUIRE(result->GetStats() != "{}");
+        auto stats = result->GetStats({"is_timeout"});
+        REQUIRE(stats.size() == 1);
+        bool has_timeout_result = (stats[0] == "true" or stats[0] == "false");
+        REQUIRE(has_timeout_result);
     }
 }
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding tests for edge cases where the stats string is malformed or missing keys.

Adding tests for malformed stats strings, missing keys, or unexpected values will improve GetStats error handling and robustness.

```suggestion
        REQUIRE(result->GetStats() != "{}");
        auto stats = result->GetStats({"is_timeout"});
        REQUIRE(stats.size() == 1);
        bool has_timeout_result = (stats[0] == "true" or stats[0] == "false");
        REQUIRE(has_timeout_result);

        // Edge case: malformed stats string
        {
            struct MalformedResult {
                std::string GetStats() const { return "{is_timeout: true"; } // missing closing brace, not valid JSON
                std::vector<std::string> GetStats(const std::vector<std::string>& keys) const {
                    // Simulate error handling: return empty string for malformed
                    return std::vector<std::string>(keys.size(), "");
                }
            } malformed_result;
            auto malformed_stats = malformed_result.GetStats({"is_timeout"});
            REQUIRE(malformed_stats.size() == 1);
            REQUIRE(malformed_stats[0] == "");
        }

        // Edge case: missing key in stats
        {
            struct MissingKeyResult {
                std::string GetStats() const { return R"({"other_key": "true"})"; }
                std::vector<std::string> GetStats(const std::vector<std::string>& keys) const {
                    // Simulate missing key: return empty string for missing
                    std::vector<std::string> out;
                    for (const auto& k : keys) {
                        out.push_back(k == "other_key" ? "true" : "");
                    }
                    return out;
                }
            } missing_key_result;
            auto missing_stats = missing_key_result.GetStats({"is_timeout"});
            REQUIRE(missing_stats.size() == 1);
            REQUIRE(missing_stats[0] == "");
        }

        // Edge case: unexpected value
        {
            struct UnexpectedValueResult {
                std::string GetStats() const { return R"({"is_timeout": "maybe"})"; }
                std::vector<std::string> GetStats(const std::vector<std::string>& keys) const {
                    // Simulate unexpected value
                    std::vector<std::string> out;
                    for (const auto& k : keys) {
                        out.push_back(k == "is_timeout" ? "maybe" : "");
                    }
                    return out;
                }
            } unexpected_value_result;
            auto unexpected_stats = unexpected_value_result.GetStats({"is_timeout"});
            REQUIRE(unexpected_stats.size() == 1);
            REQUIRE((unexpected_stats[0] != "true" && unexpected_stats[0] != "false"));
        }
    }
}
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.


std::vector<std::string>
DatasetImpl::GetStats(const std::vector<std::string>& stat_keys) const {
auto json = JsonType::parse(this->stats_);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Consider error handling for invalid JSON in stats_.

If stats_ is malformed, JsonType::parse may throw or behave unpredictably. Adding validation or error handling would improve robustness, especially if stats_ is externally set.

Comment on lines 2124 to 2130
REQUIRE(result->GetStats() != "{}");
auto stats = result->GetStats({"is_timeout"});
REQUIRE(stats.size() == 1);
bool has_timeout_result = (stats[0] == "true" or stats[0] == "false");
REQUIRE(has_timeout_result);
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Consider adding tests for edge cases where the stats string is malformed or missing keys.

Adding tests for malformed stats strings, missing keys, or unexpected values will improve GetStats error handling and robustness.

Suggested change
REQUIRE(result->GetStats() != "{}");
auto stats = result->GetStats({"is_timeout"});
REQUIRE(stats.size() == 1);
bool has_timeout_result = (stats[0] == "true" or stats[0] == "false");
REQUIRE(has_timeout_result);
}
}
REQUIRE(result->GetStats() != "{}");
auto stats = result->GetStats({"is_timeout"});
REQUIRE(stats.size() == 1);
bool has_timeout_result = (stats[0] == "true" or stats[0] == "false");
REQUIRE(has_timeout_result);
// Edge case: malformed stats string
{
struct MalformedResult {
std::string GetStats() const { return "{is_timeout: true"; } // missing closing brace, not valid JSON
std::vector<std::string> GetStats(const std::vector<std::string>& keys) const {
// Simulate error handling: return empty string for malformed
return std::vector<std::string>(keys.size(), "");
}
} malformed_result;
auto malformed_stats = malformed_result.GetStats({"is_timeout"});
REQUIRE(malformed_stats.size() == 1);
REQUIRE(malformed_stats[0] == "");
}
// Edge case: missing key in stats
{
struct MissingKeyResult {
std::string GetStats() const { return R"({"other_key": "true"})"; }
std::vector<std::string> GetStats(const std::vector<std::string>& keys) const {
// Simulate missing key: return empty string for missing
std::vector<std::string> out;
for (const auto& k : keys) {
out.push_back(k == "other_key" ? "true" : "");
}
return out;
}
} missing_key_result;
auto missing_stats = missing_key_result.GetStats({"is_timeout"});
REQUIRE(missing_stats.size() == 1);
REQUIRE(missing_stats[0] == "");
}
// Edge case: unexpected value
{
struct UnexpectedValueResult {
std::string GetStats() const { return R"({"is_timeout": "maybe"})"; }
std::vector<std::string> GetStats(const std::vector<std::string>& keys) const {
// Simulate unexpected value
std::vector<std::string> out;
for (const auto& k : keys) {
out.push_back(k == "is_timeout" ? "maybe" : "");
}
return out;
}
} unexpected_value_result;
auto unexpected_stats = unexpected_value_result.GetStats({"is_timeout"});
REQUIRE(unexpected_stats.size() == 1);
REQUIRE((unexpected_stats[0] != "true" && unexpected_stats[0] != "false"));
}
}
}

- implement for hgraph & ivf
- introduce new search param: "max_time_cost_ms"(double)
- update doc for ivf

Signed-off-by: LHT129 <tianlan.lht@antgroup.com>
Signed-off-by: LHT129 <tianlan.lht@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants